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Abstract 


A volcano is a complex system, and the characterization of its state at any given 
time is not an easy task. Monitoring data can be used to estimate the probability 
of an unrest and/or an eruption episode. These can include seismic, magnetic, 
electromagnetic, deformation, infrasonic, thermal, geochemical data or, in an ideal 
situation, a combination of them. Merging data of different origins is a non-trivial 
task, and often even extracting few relevant and information-rich parameters from 
a homogeneous time series is already challenging. The key to the characteriza- 
tion of volcanic regimes is in fact a process of data reduction that should produce 
a relatively small vector of features. The next step is the interpretation of the 
resulting features, through the recognition of similar vectors and for example, 
their association to a given state of the volcano. This can lead in turn to highlight 
possible precursors of unrests and eruptions. This final step can benefit from the 
application of machine learning techniques, that are able to process big data in an 
efficient way. Other applications of machine learning in volcanology include the 
analysis and classification of geological, geochemical and petrological “static” data 
to infer for example, the possible source and mechanism of observed deposits, the 
analysis of satellite imagery to quickly classify vast regions difficult to investigate 
on the ground or, again, to detect changes that could indicate an unrest. Moreover, 
the use of machine learning is gaining importance in other areas of volcanology, 
not only for monitoring purposes but for differentiating particular geochemical 
patterns, stratigraphic issues, differentiating morphological patterns of volcanic 
edifices, or to assess spatial distribution of volcanoes. Machine learning is helpful 
in the discrimination of magmatic complexes, in distinguishing tectonic settings of 
volcanic rocks, in the evaluation of correlations of volcanic units, being particularly 
helpful in tephrochronology, etc. In this chapter we will review the relevant meth- 
ods and results published in the last decades using machine learning in volcanology, 
both with respect to the choice of the optimal feature vectors and to their subse- 
quent classification, taking into account both the unsupervised and the supervised 
approaches. 


Keywords: machine learning, volcano seismology, volcano geophysics, 
volcano geochemistry, volcano geology, data reduction, feature vectors 
1. Introduction 
Pyroclastic density currents, debris flow avalanches, lahars, ash falls can affect 


dramatically the life of people living close to volcanoes, and other volcanic products 
such as lava flows can severely affect properties and infrastructures. Several volcanoes 
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lie close to highly populated areas and the impact of their eruptions could be eco- 
nomically very strong. Stochastic forecasts of volcanic eruptions are difficult [1, 2], 
but deterministic forecasts (i.e., specifying when, where, how an eruption will occur) 
are even harder. Many volcanoes are monitored by observatories that try to estimate 
at least the probability of the different hazardous volcanic events [3]. Different time 
series can be monitored and hopefully used for forecasting, including seismic data 
[4], geomagnetic and electromagnetic data [5], geochemical data [6], deformation 
data [7], infrasonic data [8], gas data [9], thermal data from satellite [10] and from 
the ground [11]. Whenever possible, a multiparametric approach is always advis- 
able. For instance, at Merapi volcano, seismic, satellite radar, ground geodetic and 
geochemical data were efficiently integrated to study the major 2010 eruption [12]; 

a multiparametric approach is essential to understand shallow processes such as the 
ones seen at geothermal systems like e.g., Dallol in Ethiopia [13]. Although many time 
series may be available, seismic data remain always at the heart of any monitoring 
system, and should always include the analysis of continuous volcanic tremor [14]; 
tremor has in fact a great potential [15] due to its persistence and memory [1, 2] and 
its sensitivity to external triggering such as regional tectonic events [16] or Earth tides 
[17]. Moreover, its time evolution can be indicative of variations in other parameters, 
such as gas flux [18]. Other information-rich time series can be built looking at the 
time evolution of the number of the different discrete volcano-seismic events that 
can be recorded on a volcano. These include volcano-tectonic (VT) earthquakes, 
rockfall events, long-period (LP) and very-long-period (VLP) events, explosions, 
etc. Counting the overall number of events is not enough: one has to detect them and 
classify them, because they are linked to different processes, as detailed below. For 
this reason it is important to generate automatically different time series for each type 
of volcano-seismic event. 

VT can be described as “normal” earthquakes which take place in a volcanic 
environment and can indicate magma movement [19, 20]. LP events have a great 
potential for forecasting [21]. Their debated interpretation involves the repeated 
expansion and compression of sub-horizontal cracks filled with steam or other 
ash-laden gas [22], stick-slip magma motion [23], fluid-driven flow [24], eddy 
shedding, turbulent slug flow, soda bottle analogues [25], deformation acceleration 
of solidified domes [26] and slow ruptures [27]. Explosion quakes are generated 
by sudden magma, ash, and gas extrusion in an explosive event, often associated 
to VLP events [28]. In many papers also “Tremor episodes” (TRE events) are 
described and counted, usually associated to magma degassing [20]. However, a 
volcano with any activity produces a continuous “tremor” which detectability only 
depends on the seismic instrumentation sensitivity [29, 30]. So, the class “TRE” 
should be better defined as “tremor episode that exceeds the detection limits”. Of 
course, at volcanoes we can also record natural but non-volcanic seismic signals 
such as far tectonic earthquakes, far explosions, etc., and also anthropogenic signals 
e.g., due to industries, ground vehicles, helicopters used for monitoring, etc. 

Most volcano observatories rely on manual classification and counting of such 
seismic events, which suffers from human subjectivity and can become unfeasible 
during an unrest or a seismic crisis [31, 32]. For this reason, manual classification 
should be substituted by an automated processing, and here is where machine 
learning (ML) comes into place. The same reasoning applies of course also to the 
automated processing of other monitoring time series, such as deformation, gas and 
water geochemistry, etc. Moreover, ML in volcanology is not restricted to monitoring 
active volcanoes but has demonstrated to be useful also when dealing with other 
large datasets. Examples include correlating volcanic units in general e.g., [33], 
of tephra e.g., [34, 35] and ignimbrites e.g., [36], a task which may become very 
difficult especially when many deposits of similar ages and geochemical and 
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petrographic characteristics crop out in a given area. ML is also effective for dis- 
criminating tectonic settings of volcanic rocks [34, 37]. Recently it has been used 
also for the prediction of trace elements in volcanic rocks [38]. 


2. Machine learning 


ML isa field of computer science dedicated to the development of algorithms 
which are based ona collection of examples of some phenomenon. These examples 
can be natural, human-generated or computer-generated. From another point of 
view, ML can be seen as the process of solving a problem by building and using 
a statistical model based on an existing dataset [39]. ML can also be defined as 
the study of algorithms that allow computer programs to automatically improve 
through experience [40]. ML is only one of the ways we expect to achieve Artificial 
Intelligence (AI). AI has in fact a wider, dynamic and fuzzier definition, e.g., 
Andrew Moore, former Dean of the School of Computer Science at Carnegie Mellon 
University, defined it as “the science and engineering of making computers behave 
in ways that, until recently, we thought required human intelligence”. ML is usually 
characterized by a series of steps: data reduction, model training, model evaluation, 
model final deployment for classification of new, unknown data (see Figure 1). The 
training (which is the proper learning phase) can be supervised, semi-supervised, 
unsupervised or based on reinforcement. 

More data does not necessarily imply better results. Low quality and irrelevant data 
can instead lead to worse classification performances. If for each datum we have a very 
high number of columns, we may wonder how many of those are really informative. 

A number of techniques can help us with this process of data reduction. The simplest 
include column variance estimations and evaluating correlations between columns. 
Each of the components of the vector that “survive” this phase is called a feature and is 
supposed to describe somehow the data item, hopefully in a way that makes it easier to 
associate the item to a given class. There are dimensionality reduction algorithms [41] 
where the output is a simplified feature vector that is (almost) equally good at describ- 
ing the data. There are many techniques to find a smaller number of independent 
features, such as Independent Component Analysis (ICA) [42], Non-negative Matrix 
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Figure 1. 

ML can be divided in several steps, from top to bottom. Raw data have first to be reduced by extracting short 
and information-rich feature vectors. These can then be used to build models that are trained, analyzed and 
finally used for classification of new data. The [labels] are present only in a (semi-)supervised approach. 
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Factorization (NMF) [43], Singular Value Decomposition [44], Principal Component 
Analysis (PCA) [45] and Auto-encoders [46]. Linear Discriminant Analysis (LDA) 
[47] uses the training samples to estimate the between-class and within-class scatter 
matrices, and then employs the Fisher criterion to obtain the projection matrix for 
feature extraction (or feature reduction). 

In supervised learning, the dataset is a collection of example couples of the 
type (data, label) (a Y; Jha y Each element x; is called a feature vector and has a 
companion label y,. In the supervised learning approach the dataset is used to 
derive a model that takes a feature vector as input and outputs a label that should 
describe it. For example, the feature vector of volcano-seismic data could contain 
several amplitude-based, spectral-based, shape-based or dynamical parameters and 
the label to be assigned could be one of those described above, i.e., VT, LP, VLP. Ina 
volcanic geochemical example, feature vectors could contain major elements weight 
percentages, and labels the corresponding rock type. The reliability of the labels is 
often the most critical issue of the setup of a supervised ML classification scheme. 
Labels should therefore be assigned carefully by experts. In general, it is much 
better to have relatively few training events with reliable labels than to have many 
more, but not so reliable, labeled examples. 

In unsupervised learning, the dataset is a collection of examples without any 
labeling, i.e., containing only the data 1%; la y Asin the previous case, each x; is 
a feature vector, and the goal is to create a model that maps a feature vector x into a 
value (or another vector) that can help solving a problem. Typical examples are all 
the clustering procedures, where the output is the cluster number to which each 
datum belongs. The choice of the best features to use is a difficult one, and several 
techniques of Unsupervised Feature Selection were proposed, with the capability of 
identifying and selecting relevant features in unlabeled data [48]. Unsupervised 
outlier detection methods [49] can also be used, where the output indicates if a 
given feature vector is likely to describe a “normal” or “anomalous” member of the 
dataset. 

The semi-supervised learning approach stands somehow in the middle, and 
the dataset contains both labeled (usually a few) and unlabeled (usually many 
more) feature vectors. The basic idea is similar to supervised learning, but with the 
possibility to exploit also the presence of (many more) unlabeled examples in the 
training phase. 

In reinforcement learning, the machine is “embedded” in an environment, 
which state is again described by a feature vector. In each state the machine can 
execute actions, which produce different rewards and can cause an environmental 
state transition. The goal in this case is to learn a policy, i.e., a function or model 
that takes the feature vector as input and outputs an optimal action to execute 
in that state. The action is optimal if it maximizes the expected average reward. 
We can also say that reinforcement learning is a behavioral learning model. The 
algorithm receives feedback from the data analysis, guiding the user to the best 
outcome. Here the main point is that the system is not trained with a sample 
dataset but learns through trial and error. Therefore, a sequence of successful 
decisions will result in that process being reinforced, because it best solves the 
problem at hand. Problems that can be tackled with this approach are the ones 
where decision making is sequential, and the goal is long-term, such as game 
playing, robotics, resource management, or logistics. Time is therefore explicitly 
used here, contrary to other approaches, in which in most of the cases data items 
are analyzed one by one without taking into account the time order in which 
they arrive. 

In some domains (and volcanology is a good example) training data are scarce. 
In this case we can profit from knowledge acquired in another domain using 
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techniques known as Transfer Learning (TL) [50]. The basic idea here is to train 

a model in one domain with abundant data (original domain) and then use it as 

a pretrained model in a different domain (with less data). There is a successive 
fine-tuning phase using domain-specific available data (in the target domain). This 
approach was applied for instance at Volcan de Fuego de Colima (Mexico) [51], 
Mount St. Helens (USA) and Bezymianny (Russia) [52]. 

Among the computer languages that are most used for implementing ML 
techniques we can cite Python [53], R [54], Java [55], Javascript [56], Julia [57] and 
Scala [58]. Many dedicated, open source libraries are available for each of them, and 
many computer codes, also specialized for volcanic and geophysical data, can be 
found in open access repositories such as GitHub [59]. 


3. Machine learning techniques 


Extracted feature vectors can become inputs to several different techniques 
of machine learning. We can cite among others Cluster Analysis (CA) [60], 
Self-Organizing Maps (SOM) [61-63], Artificial Neural Networks (ANN) and 
Multi Layer Perceptrons (MLP) [64-66], Support Vector Machines (SVM) [67], 
Convolutional Neural Networks (CNN) [51], Recurrent Neural Networks (RNN) 
[68], Hidden Markov Models (HMM) [3, 31, 69-71] and their Parallel System 
Architecture (PSA) based on Gaussian Mixture Models (GMM) [72]. 

CA (Figure 2a) is an unsupervised learning approach aimed at grouping similar 
data while separating different ones, where similarity is measured quantitatively 
using a distance function in the space of feature vectors. The clustering algorithms 
can be divided into hierarchical and non-hierarchical. In the former a tree-like 
structure is built to represent the relations between clusters, while in the latter new 
clusters are formed by merging or splitting existing ones without following a tree- 
like structure but just grouping the data in order to maximize or minimize some 
evaluation criteria. CA includes a vast class of algorithms, including e.g., K-means, 
K-medians, Mean-shift, DBSCAN, Expectation—Maximization (EM), Clustering 
using Gaussian Mixture Models (GMM), Agglomerative Hierarchical, Affinity 
Propagation, Spectral Clustering, Ward, Birch, etc. Most of these methods are 
described and implemented in the open-source Python package scikit-learn [73]. 
The use of six different unsupervised, clustering-based methods to classify volcano 
seismic events was explored at Cotopaxi Volcano [32]. One of the most difficult 
issues is the choice of the number of clusters into which the data should be divided; 
this number in most of the cases has in fact to be fixed a priori before running 
the code. Several techniques exist in order to help with this choice, such as elbow, 
silhouette, gap statistics, heuristics, etc. Many of them are described and included 
in the R package NbClust [74]. Problems arise when the estimates that each of them 
provides are contradictory. 

Another approach to unsupervised classification is SOM (Figure 2b) or 
Kohonen maps [75, 76], atype of ANN trained to produce a low dimensional, 
usually 2D, discretized representation of the feature vector space. The training is 
based on competitive and collaboration learning, using a neighborhood function to 
preserve the input topological properties. 

A very common type of ANN, often used for supervised classification, is MLP, 
which consists of at least three layers of nodes (Figure 2c): an input layer, (at least) 
one hidden layer and an output layer. Nodes use nonlinear activation functions and 
are trained through the backpropagation mechanism. If the number of hidden layers 
of an ANN becomes very high, we talk of Deep Neural Networks (DNN), which are 
also used mainly in a supervised fashion. Among DNN, the CNN (Figure 2d) contain 
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Figure 2. 

Schematic illustration of some of the ML techniques described in the text. (a) Cluster analysis in its hierarchical 
and non-hierarchical versions. (b) Self-organizing maps (c) multilayer perceptron (d) convolutional neural 
network. 


at least some convolutional layers, that convolve their inputs with a multiplication or 
other dot product. The activation function in the case of CNN is commonly a recti- 
fied linear unit (ReLU), and there are also pooling layers, fully connected layers and 
normalization layers. 

A RNN isatype of ANN with a feedback loop (Figure 3a), in which neuron 
outputs can also be used as neuron inputs in the same layer, allowing to maintain 
some information during the training process. Long Short Term Memory networks 
(LSTM) area subset of RNN, capable of learning long-term dependencies [77] and 
better remember information for long periods of time. RNN can be used for both 
supervised and unsupervised learning. 
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Figure 3. 
Schematic illustration of some of the ML techniques described in the text. (a) Recurrent neural network 
(b) logistic regression (c) support vector machine (d) random forest (e) hidden Markov model. 


Logistic regression (LR) (Figure 3b) is a supervised generalized linear model, 
i.e., the classification (probability) dependence on the features is linear [78]. In order 
to avoid the problems linked to high dimensionality of the data, techniques such as 
the Least Absolute Shrinkage and Selection Operator (LASSO) can be applied to 
reduce the number of dimensions of the feature vectors which are input to LR [79]. 

SVM (Figure 3c) constitute a supervised statistical learning framework [80]. It 
is most commonly used as a non-probabilistic binary classifier. Examples are seen as 
points in space, and the aim is to separate categories by a gap that is as wide as possible. 
Unknown samples are then assigned to a category based on the side of the gap on 
which they fall. In order to perform a non-linear classification, data are mapped into 
high-dimensional feature spaces using suitable kernel functions. 
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Sparse Multinomial Logistic Regression (SMLR) is a class of supervised methods 
for learning sparse classifiers that incorporate weighted sums of basis functions 
with sparsity-promoting priors encouraging the weight estimates to be either 
significantly large or exactly zero [81]. The sparsity concept is similar to the one at 
the base of Non-negative Matrix Factorization (NMF) [82]. The sparsity-promoting 
priors result in an automatic feature selection, enabling to somehow avoid the 
so-called “curse of dimensionality”. So, sparsity in the kernel basis functions and 
automatic feature selection can be achieved at the same time [83]. SMLR methods 
control the capacity of the learned classifier by minimizing the number of basis 
functions used, resulting in better generalization. There are fast algorithms for 
SMLR that scale favorably in both the number of training samples and the feature 
dimensionality, making them applicable even to large data sets in high-dimensional 
feature spaces. 

A Decision Tree (DT) is an acyclic graph. At each branching node, a specific 
feature x, is examined. The left or right branch is followed depending on the value 
of x, in relation to a given threshold. A class is assigned to each datum when a leaf 
node is reached. As usual, a DT can be learned from labeled data, using different 
strategies. In the DT class we can mention Best First Decision Tree (BFT), 
Functional Tree (FT), J48 Decision Tree (J48DT), Naïve Bayes Tree (NBT) and 
Reduced Error Pruning Trees (REPT). Ensemble learning techniques such as 
Random SubSpace (RSS) can be used to combine the results of the different 
trees [84]. 

The Boosting concept, a kind of ensemble meta-algorithm mostly (but not only) 
associated to supervised learning, uses original training data to create iteratively 
multiple models by using a weak learner. Each model would be different from the 
previous one as the weak learners try to “fix” the errors made by previous models. 
An ensemble model will then combine the results of the different weak models. On 
the other side, Bootstrap aggregating, also called by the contracted name Bagging, 
consists of creating many “almost-copies” of the training data (each copy is slightly 
different from the others) and then apply a weak learner to each copy and finally 
combine the results. A popular and effective algorithm based on bagging is Random 
Forest (RF). Random Forest (Figure 3d) is different from the standard bagging in 
just one way. At each learning step, a random subset of the features is chosen; this 
helps to minimize correlation of the trees, as correlated predictors are not efficient 
in improving classification accuracy. Particular attention has to be taken in order to 
best choose the number of trees and the size of the random feature subsets. 

A Hidden Markov Model (HMM) (Figure 2e) is a statistical model in which the 
system being modeled is assumed to be a Markov process. It describes a sequence 
of possible events for which the probability of each event depends only on the state 
occupied in the previous event. The states are unobservable (“hidden”) but at each 
state the Model emits a “message” which depends probabilistically on the current 
state. Applications are wide in scope, from reinforcement learning to temporal pat- 
tern recognition, and the approach works well when time is important; speech [85], 
handwriting and gesture recognition are then typical fields of applications, but also 
volcano seismology [69, 86]. 


4. Applications to seismo-volcanic data 


Eruptions are usually preceded by some kind of change in seismicity, making 
seismic data one of the key dataset in any attempt to forecast volcanic activity [4]. 
As we mentioned before, manual detection and classification of discrete events 
can be very time consuming, up to becoming unfeasible during a volcanic crisis. 
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An automatic classification procedure becomes therefore highly valuable, also asa 
first step towards forecasting techniques such as material Failure Forecast Method 
(FFM) [87, 88]. Feature vectors should be built in order to provide most informa- 
tion about the source, minimizing e.g., path and site effects. In many cases features 
can be independent from a specific physical model describing a phenomenon. This 
allows ML to work well even when there is no scientific agreement on the generation 
of a given seismic signal. A good example in volcano seismology is given by the LP 
events. Standardizing data, making them independent from unwanted variables 

is also in general a convenient approach [31]. Time-domain and spectral-based 
amplitudes, spectral phases, auto- and cross-correlations, statistical and dynamical 
parameters have been considered as the output of data reduction procedures that 
can be included into feature vectors [14]. In the literature, these have included linear 
predictor coding for spectrograms [66], wavelet transforms [89], spectral autocor- 
relation functions [90], statistical and cepstral coefficients [91]. Extracted feature 
vectors become then the input to one or another ML method. 

CA is probably the most used class of unsupervised techniques and the applica- 
tions to volcano seismology follow this general rule. Spectral clustering was applied 
e.g., to seismic data of Piton de la Fournaise [60]. The fact that e.g., LP seismic 
signals can be clustered into families indicates that the family members are very 
similar to each other. The existence of similar events implies similar location and 
similar source process, i.e., it means the presence of a source that repeats over time in 
an almost identical way. Clustering data after some kind of normalization forces CA 
algorithms to look for similar shapes, independently of size. If significant variations 
in amplitude are then seen within families, this can indicate that the source processes 
of these events are not only repeatable but also scalable in size, as observed e.g., at 
Soufrière Hills Volcano, Montserrat [92] or at Irazú, Costa Rica [93]. The similarity 
of events in the different classes can then be used to detect other events, e.g., for the 
purpose of stacking them and obtain more accurate phase arrivals; this was done 
e.g., at Kanlaon, Philippines [94]. For this purpose, an efficient open-source package 
is available, called Repeating Earthquake Detector in Python (REDPy) [95]. 

In volcano-seismology SOM were applied e.g., to Raoul Island, New Zealand 
[61]. A hierarchical clustering was applied to results of SOM tremor analysis at 
Ruapehu [62] and Tongariro [96] in New Zealand, using the Scilab environment. 
A similar combined approach was applied in Matlab to Etna volcanic tremor 
[97]. Several geometries of SOM were used, with rectangular or hexagonal 
nearest neighbors cells, planar, toroidal or spherical maps, etc. [61]. The clas- 
sic ANN/MLP approach was applied e.g., to seismic data recorded at Vesuvius 
[66], Stromboli [98], Etna [99], while DNN architectures were applied e.g., to 
Volcan de Fuego, Colima [100]. The use of genetic algorithms for the optimiza- 
tion of the MLP configuration was proposed for the analysis of seismic data of 
Villarrica, Chile [101]. CNN were applied e.g., to Llaima Volcano (Chile) seismic 
data, comparing the results to other methods of classification [102]. RNNs were 
applied, together with other methods, to classify signals of Deception Island 
Volcano, Antarctica [68]. The architectures were trained with data recorded in 
1995-2002 and models were tested on data recorded in 2016-2017, showing good 
generalization accuracy. 

Supervised LR models have been applied in the estimation of landslide suscepti- 
bility [103] and to volcano seismic data to estimate the ending date of an eruption at 
Telica (Nicaragua) and Nevado del Ruiz (Colombia) [104]. SVM were applied many 
times to volcano seismology e.g., to classify volcanic signals recorded at Llaima, 
Chile [105] and Ubinas, Peru [106]. Multinomial Logistic Regression was used, 
together with other methods, to evaluate the feasibility of earthquake prediction 
using 30 years of historical data in Indonesia, also at volcanoes [107]. 
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RF was applied to the discrimination of rockfalls and VT recorded at Piton de la 
Fournaise in 2009-2011 and 2014-2015. 60 features were used, and excellent results 
were obtained. However, a RF trained with 2009-2011 data did not perform well on 
data recorded in 2014-2015, demonstrating how difficult it is to generalize models 
even at the same volcano [108]. RF, together with other methods, was recently used 
on volcano seismic data with the specific purpose to determine when an eruption 
has ended [104], a problem which is far from being trivial. RF was also used to 
derive ensemble mean decision tree predictions of sudden steam-driven eruptions 
at Whakaari (New Zealand) [109]. 

Most of the methods described so far try to classify discrete seismic events that 
were already extracted from the continuous stream, i.e., already characterized by a 
given start and end. There are therefore in general two separated phases: detection 
and classification [106]. Continuous HMM on the other side are able to process 
continuous data and can therefore extract and classify in a single, potential real- 
time, step. HMM are finite-state machines and model sequential patterns where 
time direction is an essential information. This is typical of (volcano) seismic data. 
For instance, P waves always arrive before S waves. HMM-based volcanic seismic 
data classifiers have therefore been used by many authors [87, 110-113]. HMM are 
also used routinely in some volcano observatories e.g., at Colima and Popocatepetl 
in Mexico [71]. Etna seismic data was processed by HMM applied to characters 
generated by the Symbolic Aggregate approXimation (SAX) which maps seismic 
data into symbols of a given alphabet [114]. HMM can be also combined with 
standardization procedures such as Empirical Mode Decomposition (EMD) when 
classifying volcano seismic data [31]. 

Another characteristic common to many of the applications published in the 
literature is the fact that feature vectors are extracted from data recorded at a 
single station. There are relatively few attempts to build multi-station classification 
schemes. At Piton de la Fournaise a system based on RF was implemented [115]. At 
the same volcano, a multi-station approach was used to classify tremor measure- 
ments and identify fundamental frequencies of the tremor associated to different 
eruptive behavior [60]. A scalable multi-station, multi-channel classifier, using 
also the empirical mode decomposition (EMD) first proposed by [31] was applied 
to Ubinas volcano (Peru). The principal component analysis is used to reduce the 
dimensionality of the feature vector and a supervised classification is carried out 
using various methods, with SVM obtaining the best performance [116]. Of course, 
with a multi-station approach particular care has to be taken in order to build a 
system which is robust with respect to the loss of one or more seismic stations due to 
volcanic activity or technical failures. 

Open source software and open access papers are luckily becoming more and 
more common. If we consider the processing and classification of volcano seismic 
data, several tools are now available for free download and use, especially within 
the Python environment. Among the most popular, we can cite ObsPy [117] and 
Msnoise [118], with which researchers and observatories can easily process big 
quantities of continuous seismic data. Once these tools have produced suitable 
feature vectors, we can look for open source software to implement the different ML 
approaches described in this contribution. Many generic ML libraries are available 
e.g., on GitHub [59] but very few are dedicated specifically to the classification of 
volcano seismic data. Among these, we can cite the recent package Python Interface 
for the Classification of Seismic Signals (PICOSS) [119]. It isa graphical, modular 
open source software for detection, segmentation and classification of seismic data. 
Modules are independent and adaptable. The classification is currently based on 
two modules that use Frequency Index analysis [120] or a multi-volcano pre-trained 
neural network, in a transfer learning fashion [52]. The concept of a multi-volcano 
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recognizer is also at the core of the EU-funded VULCAN.ears project [31, 121]. The 
aim is to build an automatic Volcano Seismic Recognition (VSR) system, conceptu- 
ally supervised (as it is based on HMM) but practically unsupervised, because once 
it is trained on a number of volcanoes with labeled sample data, it can be used on 
volcanoes without any previous data in an unsupervised fashion. The idea is in fact 
to build robust models trained on many datasets recorded by different teams on 
different volcanoes, and to integrate these models on the routinely used monitor- 
ing system of any volcano observatory. Also in this case, the open source software 
is made freely available; this includes a command interface called PyVERSO [122] 
based on HTK, a speech recognition HMM toolkit [123], a graphical interface called 
geoStudio and a script called liveVSR, able to process real-time data downloaded 
from any online seismic data server [124], together with some pre-trained ML 
models [125]. 

As we mentioned before, in order to train supervised models for classifying 
seismic events, few events with reliable labels are better than many unreliably 
labeled examples. Just to give a rough idea, 20 labeled events per class is a good 
starting point, but a minimum of 50 labeled events per class is recommended. 
Labelling discrete events is enough for many methods, but for approaches like 
HMM, where the concept is to run the classification on continuous data, it is 
essential to have a sufficient number of continuously labeled time periods, in order 
to “show” the classifier enough examples of transition from tremor to a discrete 
event, and then back to tremor. It is important to have many examples also of 
“garbage” events, i.e., events we are not interested in, so that the classifier can rec- 
ognize and discard them. Finally, it is advisable to have a wide variability of events 
within each given class rather than having many very similar events. There is not 
yet an agreement on a single file format to store these labels. As speech recognition 
is much older and more developed than seismic recognition, it is suggested to adopt 
standard labelling formats of that domain, i.e., the transcription MLF files, which 
are normal text files that include for each event the start time, the end time and of 
course the label. These files can be created manually with a simple text editor, or by 
using a program with a GUI, such as geoStudio [124] or Seismo_volcanalysis [126]. 
Other graphical software packages like SWARM [127] use other formats to store 
the labels, such as CSV, but it is always possible to build scripts that convert the 
resulting label files into MLF format, which remains the recommended one. 


5. Applications of machine learning to geochemical data 


ML applications to geochemical data of volcanoes are increasing in the last 
years, although most of them are limited to the use of cluster analysis. CA has been 
used for example to identify and quantify mixing processes using the chemistry of 
minerals [128], also for the study of volcanic aquifers [129, 130] or to differentiate 
magmatic systems e.g., [131]. Platforms used to carry out these analyses include 
the Statistical Toolbox in Matlab [132], or the R platform [54]; some geochemical 
software made in this last platform include the CA as the GCDkit [33]. In most ML 
analyses on geochemical samples it is common to use whole rock major elements 
and selected trace elements; some applications also include isotopic ratios. Many 
ML applications to geochemical data use more than one technique, frequently 
combining both unsupervised and supervised approaches. 

A combination of SVM, RF and SMLR approaches were used by [37] to account 
for variations of geochemical composition of rocks from eight different tectonic 
settings. The authors note that SVM used to discriminate tectonic settings as used 
by [34] is a powerful tool. The RF approach is shown to have the advantage, with 
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respect to SVM, of providing the importance of each feature during discrimination. 
The weakness of applying the RF for tectonic setting discrimination is that the 
evaluation based only on a majority vote of multiple decision trees often makes the 
obtained quantitative geochemical interpretation of these elements and isotopic 
ratios difficult. The authors suggest that the best quantitative discriminant is that 
of SMLR, as it allows to assign to each sample a probability of belonging to a given 
group (tectonic setting in this case), with still the possibility of identifying the 
importance of each feature. This tool is a notable step forward in the discrimination 
of the geochemical signature of the different tectonic settings, which is commonly 
assessed based on binary or ternary diagrams e.g., [133, 134] which are useful with 
many samples but are not able to differentiate a tectonic setting where a complex 
evolution of magmas has occurred. In the last decade multielement variation 
diagrams were proposed e.g., [135] and also the use of Decision Trees e.g., [136] or 
LDA e.g., [137] to accurately assign a tectonic setting based on rock geochemistry. 
Based on rock sample geochemistry, [37] show that a set of 17 elements and isoto- 
pic ratios is needed to clearly identify the tectonic setting. Two new discriminant 
functions were recently proposed to discriminate the tectonic settings of mid-ocean 
ridge (MOR) and oceanic plateau (OP). 10 datasets (original concentrations as well 
as isometric log-ratio transformed variables; all 10 major elements as well as all 10 
major and 6 trace elements) were used to evaluate the quality of discrimination 
from LDA and canonical analysis [138]. 

The software package Compositional Data Package (CoDaPack) [139] anda 
combination of unsupervised (CA) and supervised (LDA) learning approaches 
was used by [36] to identify compositional variation of ignimbrite magmas in the 
Central Andes, trying to use these methods as a tool for ignimbrite correlation. 
They have used the Statistica software [140] for both CA and LDA. 

Correlating tephra and identifying their volcanic sources is a very difficult task, 
especially in areas where several volcanoes had explosive eruptions in a relatively 
short period of time. This is particularly challenging when volcanoes have similar 
geochemical and petrographic compositions. Electron microprobe analysis of glass 
compositions and whole-rock geochemical analyses are used frequently to make 
these correlations. However, correlations may not be so accurate when using only 
geochemical tools that may mask diagnostic variability; sometimes one of the most 
important advantages of ML in this regard is the speed at which correlations can 
be made, rather than the accuracy [35]. Other contributions however demonstrate 
how ML techniques can make these correlations also accurate. Some highly accurate 
results of ML techniques applied to tephra correlation include those of LDA [141, 142] 
and SVM e.g., [143]; however, SVM may fail in specific cases and for the case study 
of tephra from Alaska volcanoes, the combination of ANN and RF are the best ML 
techniques to apply [35]. The authors use the R software [54] to apply these methods, 
and they underline the advantage of producing probabilistic outputs. 

SOM was used as an unsupervised neural network approach to analyze geo- 
chemical data of Ischia, Vesuvius and Campi Flegrei [144]. The advantage of this 
method is that there is no need of previous knowledge of geochemical or petrologi- 
cal characteristics and that it allows the use of large databases with large number 
of variables. The SOM toolbox for Matlab [132] was used by [144] to perform two 
tests, the first based on major elements and selected trace elements to find similar 
evolution processes, the second to investigate the magmatic source, so a vector 
containing a selection of ratios between major and trace elements was adopted. 
One of the enhancements of this method is that the resulting clusters permitted to 
differentiate rock samples that were only comparably distinguished by 2D diagrams 
of isotopic ratios; in other words, similar results were obtained with the limited 
availability of less expensive geochemical data. 
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One of the applications of ML techniques that maybe extremely useful in geo- 
chemistry is the apparent possibility of predicting the concentration of unknown 
elements if a large number of data of other elements is known. A combination of 
ML techniques was used by [38] to predict Rare Earth Elements (REE) concentra- 
tions on Ocean Island Basalts (OIB) using RF. They used 1283 analyses of which 
80% were used for training and the remaining 20% to validate the results. They 
found good estimations only in the Light Rare Earth Elements (LREE), suggesting 
that the results may be improved by using a larger set of input data for training. 
One possible solution may be the use of not only major elements for training but 
also of other trace elements obtained through the same analytical method of major 
elements. 

The origin of the volcanoes in Northeast China, analyzed by RF and DNN using 
the full chemical compositional data, was associated to the Pacific slab, subducting 
at Japan, reaching ~600-km depth under eastern China, and extending horizontally 
up to Mongolia. The boundary between volcanoes triggered by fluids and melts 
from the slab and those not related to it was located at the westernmost edge of the 
deeply buried Pacific slab [145]. 

As highlighted by [143] ML methods require the integration with other tech- 
niques such as fieldwork, petrographic observations and classic geochemical studies 
to obtain a clearer picture of the investigated problem. While in other fields, it is 
relatively easy (and cheap) to acquire big amounts of data (hundreds or more), this 
is not the case for geochemistry. However, we underline that the application of ML 
techniques to the geochemistry of volcanic rocks does need a minimum dataset size. 
In the literature a set of 250 analyses is described as sufficiently large amount of 
data but, as usual, one can try using the available data (often even less than 50) but 
thousands of examples would definitely improve the results. 


6. Applications of machine learning to other volcanological data 


ML appears more and more often in volcanology literature, and specific fields of 
application span now also other sub-disciplines. 

Mount Erebus in Antarctica has a persistent lava lake showing Strombolian 
activity, but its location is definitely remote. Therefore, automatic methods to 
detect these explosions are highly needed. A CNN was trained using infrared 
images captured from the crater rim and “labeled” with the help of accompany- 
ing seismic data, which was not used anymore during the subsequent automatic 
detection [146]. 

Clast morphology is a fundamental tool also for studies concerning volcanic 
textures. Texture analysis of clasts provides in particular information about genesis, 
transport and depositional processes. Here, ML has still to be developed fully but 
e.g., the application of preprocessing techniques such as the Radon transform can be 
a first step towards an efficient definition of feature vectors to be used for classifica- 
tion, as shown e.g., at Colima volcano [147]. 

The Museum of Mineralogy, Petrography and Volcanology of the University of 
Catania implemented a communication system based on the visitor’s personal expe- 
rience to learn by playing. There is a web application called I-PETER: Interactive 
Platform to Experience Tours and Education on the Rocks. This platform includes 
a labeled dataset of images of rocks and minerals to be used also for petrological 
investigations based on ML [148]. 

Satellite remote sensing technology is increasingly used for monitoring the sur- 
face of the Earth in general, and volcanoes in particular, especially in areas where 
ground monitoring is scarce or completely missing. For instance, in Latin America 
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202 out of 319 Holocene volcanoes did not have seismic, deformation or gas moni- 
toring in 2013 [7]. A complex-valued CNN was proposed to extract areas with land 
shapes similar to given samples in interferometric synthetic aperture radar (InSAR), 
a technique widely applied in volcano monitoring. An application was presented 
grouping similar small volcanoes in Japan [149]. InSAR measurements have great 
potential for volcano monitoring, especially where images are freely available. ML 
methods can be used for the initial processing of single satellite data. Processing of 
potential unrest areas can then fully exploit integrated multi-disciplinary, multi- 
satellite datasets [7]. The Copernicus Programme of the European Space Agency 
(ESA) and the European Union (EU) has recently contributed by producing the 
Sentinel-2 multispectral satellites, able to provide high resolution satellite data 

for disaster monitoring, as well as complementing previous satellite images like 
Landsat. The free access policy also promotes an increasing use of Sentinel-2 data, 
which is often processed by ML techniques such as SVM and RF [150]. A transfer 
learning strategy was applied to ground deformation in Sentinel-1 data [151] anda 
range of pretrained networks was tested, finding that AlexNet [152] is best suited to 
this task. The positive results were checked by a researcher and fed back for model 
updating. 

The global volcano monitoring platform MOUNTS (Monitoring Unrest from 
Space) uses multisensor satellite-based imagery (Sentinel-1 Synthetic Aperture 
Radar SAR, Sentinel-2 Short-Wave InfraRed SWIR, Sentinel-5P TROPOMI), 
ground-based seismic data (GEOFON and USGS global earthquake catalogs), and 
CNN to provide support for volcanic risk assessment. Results are visualized on an 
open-access website. The efficiency of the system was tested on several eruptions 
(Erta Ale 2017, Fuego 2018, Kilauea 2018, Anak Krakatau 2018, Ambrym 2018, and 
Piton de la Fournaise 2018-2019) [153]. 

Debris flow events are one of the most widespread and dangerous natural 
processes not only on volcanoes but more in general in mountainous environ- 
ments. A methodology was recently proposed [154] that combines the results of 
deterministic and heuristic/probabilistic models for susceptibility assessment. RF 
models are extensively used to represent the heuristic/probabilistic component 
of the modeling. The case study presented is given by the Changbai Shan volcano, 
China [154]. 

Mapping lava flows from satellite is another important remote sensing applica- 
tion. RF was applied to 20 individual flows and 8 groups of flows of similar age 
using a Landsat 8 image and a DEM of Nyamuragira (Congo) with 30 m resolution. 
Despite spectral similarity, lava flows of contrasting age can be well discriminated 
and mapped by means of image classification [155]. 

The hazard related to landslides at volcanoes is also significant. DNN models 
were proposed for landslide susceptibility assessment in Viet Nam, showing consid- 
erable better performance with respect to other ML methods such as MLP, SVM, DT 
and RF [156]. The use of DNN approach could be therefore an interesting approach 
for the landslide susceptibility mapping of active volcanoes. 

Muon imaging has been successfully used by geophysicists to investigate the 
internal structure of volcanoes, for example at Etna (Italy) [157]. Muon imaging is 
essentially an inverse problem and it can profit from the application of ML tech- 
niques, such as ANN and CA [158]. 

Combinations of supervised and unsupervised ML techniques have been used to 
map volcanoes also on other planets. A ML paradigm was designed for the iden- 
tification of volcanoes on Venus [159]. Other studies have used topographic data, 
such as DEM and associated derivatives obtained from orbital images, to detect and 
classify manually labeled Martian landforms including volcanoes [160]. 
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7. Conclusions 


ML techniques will have an increasing impact on how we study and model 
volcanoes in all their aspects, how we monitor them and how we evaluate their 
hazards, both in the short and in the long term. The increasing number of moni- 
toring equipment installed on volcanoes on one side provides more and more data, 
on the other often causes their real time processing unfeasible especially when 
most needed i.e., during unrest and eruptions. Here ML will show its best useful- 
ness, as it can provide the perfect tools to sift through big data to identify subtle 
patterns that could indicate unrest, hopefully well before eruptions. One impor- 
tant issue is the one of generalization. We must go towards the construction of ML 
models that can be applied on different volcanoes, for instance when previous data 
is not available for training specific models. The concepts of transfer learning can 
be important here. 

The routine use of ML tools at the different volcano observatories should be 
promoted by providing easy installation procedures and easy integration into 
existing monitoring systems. Open source software should be always chosen 
whenever possible. On the other hand, observatories should provide good open 
training data to ML developers, researchers and data scientists in order to improve 
the models in a virtuous circle. An easy availability of open access data, both from 
the ground and from satellites should be exploited for building reliable training 
sets in the different fields of volcanology. This will allow “scientific competition” 
between research groups using different ML approaches and make a direct com- 
parison of results easier, like it is common in other disciplines where “standard” 
training datasets are available for download to everybody. 
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